Method and apparatus for video coding using matrix based cross-component prediction

ABSTRACT

A method and an apparatus for video coding using a matrix-based cross-component prediction are disclosed. The video coding method and apparatus predict a chroma component of a current block, by using a deep learning-based matrix operation, from a chroma component spatially adjacent to a chroma block of the current block and from a luma component spatially adjacent to a luma block corresponding to the chroma block.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of International Application No.PCT/KR2022/003214 filed on Mar. 7, 2022, which claims priority to KoreanPatent Application No. 10-2021-0030284 filed on Mar. 8, 2021, and KoreanPatent Application No. 10-2022-0028498 filed on Mar. 7, 2022, the entiredisclosures of which are incorporated herein by reference.

TECHNICAL FIELD

The present disclosure relates to a video coding method and an apparatususing a matrix-based cross-component prediction.

BACKGROUND

The statements in this section merely provide background informationrelated to the present disclosure and do not necessarily constituteprior art.

Since video data has a large amount of data compared to audio or stillimage data, the video data requires a lot of hardware resources,including memory, to store or transmit the video data without processingfor compression.

Accordingly, an encoder is generally used to compress and store ortransmit video data. A decoder receives the compressed video data,decompresses the received compressed video data, and plays thedecompressed video data. Video compression techniques include H.264/AVC,High Efficiency Video Coding (HEVC), and Versatile Video Coding (VVC),which has improved coding efficiency by about 30% or more compared toHEVC.

However, since the image size, resolution, and frame rate graduallyincrease, the amount of data to be encoded also increases. Accordingly,a new compression technique providing higher coding efficiency and animproved image enhancement effect than existing compression techniquesis required.

Recently, deep learning-based image processing techniques have beenapplied to existing encoding elemental technologies. Coding efficiencycan be improved by applying deep learning-based image processingtechniques to existing encoding techniques, in particular, compressiontechniques such as inter prediction, intra prediction, in-loop filter,transform, etc. Representative application examples include interprediction based on virtual reference frames generated by deep learningmodels and include in-loop filter based on denoising models. Therefore,deep learning-based image processing technology needs to be employedfurther to improve the coding efficiency in image encoding/decoding.

SUMMARY

The present disclosure in some embodiments seeks to provide a videocoding method and an apparatus for predicting a chroma component of acurrent block using a luma component. The video coding method andapparatus predict the chroma component of the current block, by using adeep learning-based matrix operation, from a chroma component spatiallyadjacent to a chroma block of the current block and from a lumacomponent spatially adjacent to a luma block corresponding to the chromablock.

At least one aspect of the present disclosure provides a methodperformed by a computing device for predicting a chroma component of acurrent block using a luma component. The method comprises obtainingreference pixels that include chroma reference pixels spatially adjacentto a chroma block of the current block and include luma reference pixelsadjacent to a luma block corresponding to the chroma block. The methodalso comprises generating an input block formed as a one-dimensional(1D) vector or a two-dimensional (2D) vector by rearranging thereference pixels. The method also comprises generating a chromaprediction block of the current block by inputting the input block intoan estimating model that is a deep learning-based model.

Another aspect of the present disclosure provides a cross-componentprediction device for predicting a chroma component of a current blockby using a luma component. The device comprises an input unit configuredto obtain reference pixels that include chroma reference pixelsspatially adjacent to a chroma block of the current block and includeluma reference pixels adjacent to a luma block corresponding to thechroma block. The device also comprises a preprocessor configured togenerate an input block formed as a one-dimensional (1D) vector or atwo-dimensional (2D) vector by rearranging the reference pixels. Thedevice also comprises an estimator comprising an estimating model thatis a deep learning-based model and configured to generate a chromaprediction block of the current block by inputting the input block intothe estimating model.

Yet another aspect of the present disclosure provides a method performedby a computing device for predicting a chroma component of a currentblock using a luma component. The method comprises obtaining referencepixels and reconstructed pixels. The reference pixels include chromareference pixels spatially adjacent to a chroma block of the currentblock and include luma reference pixels adjacent to a luma blockcorresponding to the chroma block. The reconstructed pixels representreconstructed pixels of the luma block. The method also comprisesgenerating an input block formed as a one-dimensional (1D) vector or atwo-dimensional (2D) vector by rearranging the reference pixels and thereconstructed pixels. The method also comprises generate a chromaprediction block of the current block by inputting the input block intoan estimating model that is a deep learning-based model.

As described above, the present disclosure provides a video codingmethod and an apparatus for predicting a chroma component of a currentblock, by using a deep learning-based matrix operation, from a chromacomponent spatially adjacent to a chroma block of the current block andfrom a luma component spatially adjacent to a luma block correspondingto the chroma block, to improve the coding efficiency of the chromacomponent of the current block.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a video encoding apparatus that mayimplement the techniques of the present disclosure.

FIG. 2 illustrates a method for partitioning a block using a quadtreeplus binarytree ternarytree (QTBTTT) structure.

FIGS. 3A and 3B illustrate a plurality of intra prediction modesincluding wide-angle intra prediction modes.

FIG. 4 illustrates neighboring blocks of a current block.

FIG. 5 is a block diagram of a video decoding apparatus that mayimplement the techniques of the present disclosure.

FIG. 6 is a diagram illustrative of neighboring pixels referenced forcross-component prediction.

FIG. 7 is an example diagram conceptually illustrating the derivation ofa linear model for cross-component prediction.

FIG. 8 is an example diagram conceptually illustrating a matrix-basedcross-component prediction device, according to at least one embodimentof the present disclosure.

FIG. 9 is an example diagram illustrating preprocessing of referencepixels, according to at least one embodiment of the present disclosure.

FIG. 10 is an example diagram conceptually illustrating a matrix-basedcross-component prediction device, according to another embodiment ofthe present disclosure.

FIG. 11 is an example illustrating a reduced chroma prediction block,according to at least one embodiment of the present disclosure.

FIG. 12 is an example diagram conceptually illustrating across-component prediction device that further utilizes reconstructedluma pixels, according to another embodiment of the present disclosure.

FIG. 13 is an example diagram conceptually illustrating across-component prediction device further utilizing reconstructed lumapixels, according to yet another embodiment of the present disclosure.

FIG. 14 is a flowchart of a cross-component prediction method, accordingto at least one embodiment of the present disclosure.

FIG. 15 is a flowchart of a cross-component prediction method, accordingto another embodiment of the present disclosure.

FIG. 16 is a flowchart of a cross-component prediction method furtherutilizing reconstructed pixels in a luma block, according to at leastone embodiment of the present disclosure.

FIG. 17 is a flowchart of a cross-component prediction method thatfurther utilizes reconstructed pixels of the luma block, according toanother embodiment of the present disclosure.

DETAILED DESCRIPTION

Hereinafter, some embodiments of the present disclosure are described indetail with reference to the accompanying illustrative drawings. In thefollowing description, like reference numerals designate like elements,although the elements are shown in different drawings. Further, in thefollowing description of some embodiments, detailed descriptions ofrelated known components and functions when considered to obscure thesubject of the present disclosure have been omitted for the purpose ofclarity and for brevity.

FIG. 1 is a block diagram of a video encoding apparatus that mayimplement technologies of the present disclosure. Hereinafter, referringto illustration of FIG. 1 , the video encoding apparatus and componentsof the apparatus are described.

The encoding apparatus may include a picture splitter 110, a predictor120, a subtractor 130, a transformer 140, a quantizer 145, arearrangement unit 150, an entropy encoder 155, an inverse quantizer160, an inverse transformer 165, an adder 170, a loop filter unit 180,and a memory 190.

Each component of the encoding apparatus may be implemented as hardwareor software or implemented as a combination of hardware and software.Further, a function of each component may be implemented as software,and a microprocessor may also be implemented to execute the function ofthe software corresponding to each component.

One video is constituted by one or more sequences including a pluralityof pictures. Each picture is split into a plurality of areas, andencoding is performed for each area. For example, one picture is splitinto one or more tiles or/and slices. Here, one or more tiles may bedefined as a tile group. Each tile or/and slice is split into one ormore coding tree units (CTUs). In addition, each CTU is split into oneor more coding units (CUs) by a tree structure. Information applied toeach CU is encoded as a syntax of the CU and information commonlyapplied to the CUs included in one CTU is encoded as the syntax of theCTU. Further, information commonly applied to all blocks in one slice isencoded as the syntax of a slice header, and information applied to allblocks constituting one or more pictures is encoded to a pictureparameter set (PPS) or a picture header. Furthermore, information, whichthe plurality of pictures commonly refers to, is encoded to a sequenceparameter set (SPS). In addition, information, which one or more SPScommonly refer to, is encoded to a video parameter set (VPS). Further,information commonly applied to one tile or tile group may also beencoded as the syntax of a tile or tile group header. The syntaxesincluded in the SPS, the PPS, the slice header, the tile, or the tilegroup header may be referred to as a high level syntax.

The picture splitter 110 determines a size of a coding tree unit (CTU).Information on the size of the CTU (CTU size) is encoded as the syntaxof the SPS or the PPS and delivered to a video decoding apparatus.

The picture splitter 110 splits each picture constituting the video intoa plurality of coding tree units (CTUs) having a predetermined size andthen recursively splits the CTU by using a tree structure. A leaf nodein the tree structure becomes the coding unit (CU), which is a basicunit of encoding.

The tree structure may be a quadtree (QT) in which a higher node (or aparent node) is split into four lower nodes (or child nodes) having thesame size. The tree structure may also be a binarytree (BT) in which thehigher node is split into two lower nodes. The tree structure may alsobe a ternarytree (TT) in which the higher node is split into three lowernodes at a ratio of 1:2:1. The tree structure may also be a structure inwhich two or more structures among the QT structure, the BT structure,and the TT structure are mixed. For example, a quadtree plus binarytree(QTBT) structure may be used or a quadtree plus binarytree ternarytree(QTBTTT) structure may be used. Here, a BTTT is added to the treestructures to be referred to as a multiple-type tree (MTT).

FIG. 2 is a diagram for describing a method for splitting a block byusing a QTBTTT structure.

As illustrated in FIG. 2 , the CTU may first be split into the QTstructure. Quadtree splitting may be recursive until the size of asplitting block reaches a minimum block size (MinQTSize) of the leafnode permitted in the QT. A first flag (QT_split_flag) indicatingwhether each node of the QT structure is split into four nodes of alower layer is encoded by the entropy encoder 155 and signaled to thevideo decoding apparatus. When the leaf node of the QT is not largerthan a maximum block size (MaxBTSize) of a root node permitted in theBT, the leaf node may be further split into at least one of the BTstructure or the TT structure. A plurality of split directions may bepresent in the BT structure and/or the TT structure. For example, theremay be two directions, i.e., a direction in which the block of thecorresponding node is split horizontally and a direction in which theblock of the corresponding node is split vertically. As illustrated inFIG. 2 , when the MTT splitting starts, a second flag (mtt_split_flag)indicating whether the nodes are split, and a flag additionallyindicating the split direction (vertical or horizontal), and/or a flagindicating a split type (binary or ternary) if the nodes are split areencoded by the entropy encoder 155 and signaled to the video decodingapparatus.

Alternatively, prior to encoding the first flag (QT_split_flag)indicating whether each node is split into four nodes of the lowerlayer, a CU split flag (split_cu_flag) indicating whether the node issplit may also be encoded. When a value of the CU split flag(split_cu_flag) indicates that each node is not split, the block of thecorresponding node becomes the leaf node in the split tree structure andbecomes the CU, which is the basic unit of encoding. When the value ofthe CU split flag (split_cu_flag) indicates that each node is split, thevideo encoding apparatus starts encoding the first flag first by theabove-described scheme.

When the QTBT is used as another example of the tree structure, theremay be two types, i.e., a type (i.e., symmetric horizontal splitting) inwhich the block of the corresponding node is horizontally split into twoblocks having the same size and a type (i.e., symmetric verticalsplitting) in which the block of the corresponding node is verticallysplit into two blocks having the same size. A split flag (split_flag)indicating whether each node of the BT structure is split into the blockof the lower layer and split type information indicating a splittingtype are encoded by the entropy encoder 155 and delivered to the videodecoding apparatus. Meanwhile, a type in which the block of thecorresponding node is split into two blocks of a form of beingasymmetrical to each other may be additionally present. The asymmetricalform may include a form in which the block of the corresponding node issplit into two rectangular blocks having a size ratio of 1:3 or may alsoinclude a form in which the block of the corresponding node is split ina diagonal direction.

The CU may have various sizes according to QTBT or QTBTTT splitting fromthe CTU. Hereinafter, a block corresponding to a CU (i.e., the leaf nodeof the QTBTTT) to be encoded or decoded is referred to as a “currentblock”. As the QTBTTT splitting is adopted, a shape of the current blockmay also be a rectangular shape in addition to a square shape.

The predictor 120 predicts the current block to generate a predictionblock. The predictor 120 includes an intra predictor 122 and an interpredictor 124.

In general, each of the current blocks in the picture may bepredictively coded. In general, the prediction of the current block maybe performed by using an intra prediction technology (using data fromthe picture including the current block) or an inter predictiontechnology (using data from a picture coded before the picture includingthe current block). The inter prediction includes both unidirectionalprediction and bidirectional prediction.

The intra predictor 122 predicts pixels in the current block by usingpixels (reference pixels) positioned on a neighbor of the current blockin the current picture including the current block. There is a pluralityof intra prediction modes according to the prediction direction. Forexample, as illustrated in FIG. 3A, the plurality of intra predictionmodes may include 2 non-directional modes including a Planar mode and aDC mode and may include 65 directional modes. A neighboring pixel and anarithmetic equation to be used are defined differently according to eachprediction mode.

For efficient directional prediction for the current block having arectangular shape, directional modes (#67 to #80, intra prediction modes#-1 to #-14) illustrated as dotted arrows in FIG. 3B may be additionallyused. The directional modes may be referred to as “wide angleintra-prediction modes”. In FIG. 3B, the arrows indicate correspondingreference samples used for the prediction and do not represent theprediction directions. The prediction direction is opposite to adirection indicated by the arrow. When the current block has therectangular shape, the wide angle intra-prediction modes are modes inwhich the prediction is performed in an opposite direction to a specificdirectional mode without additional bit transmission. In this case,among the wide angle intra-prediction modes, some wide angleintra-prediction modes usable for the current block may be determined bya ratio of a width and a height of the current block having therectangular shape. For example, when the current block has a rectangularshape in which the height is smaller than the width, wide angleintra-prediction modes (intra prediction modes #67 to #80) having anangle smaller than 45 degrees are usable. When the current block has arectangular shape in which the width is larger than the height, the wideangle intra-prediction modes having an angle larger than −135 degreesare usable.

The intra predictor 122 may determine an intra prediction to be used forencoding the current block. In some examples, the intra predictor 122may encode the current block by using multiple intra prediction modesand also select an appropriate intra prediction mode to be used fromtested modes. For example, the intra predictor 122 may calculaterate-distortion values by using a rate-distortion analysis for multipletested intra prediction modes and also select an intra prediction modehaving best rate-distortion features among the tested modes.

The intra predictor 122 selects one intra prediction mode among aplurality of intra prediction modes and predicts the current block byusing a neighboring pixel (reference pixel) and an arithmetic equationdetermined according to the selected intra prediction mode. Informationon the selected intra prediction mode is encoded by the entropy encoder155 and delivered to the video decoding apparatus.

The inter predictor 124 generates the prediction block for the currentblock by using a motion compensation process. The inter predictor 124searches a block most similar to the current block in a referencepicture encoded and decoded earlier than the current picture andgenerates the prediction block for the current block by using thesearched block. In addition, a motion vector (MV) is generated, whichcorresponds to a displacement between the current bock in the currentpicture and the prediction block in the reference picture. In general,motion estimation is performed for a luma component, and a motion vectorcalculated based on the luma component is used for both the lumacomponent and a chroma component. Motion information includinginformation the reference picture and information on the motion vectorused for predicting the current block is encoded by the entropy encoder155 and delivered to the video decoding apparatus.

The inter predictor 124 may also perform interpolation for the referencepicture or a reference block in order to increase accuracy of theprediction. In other words, sub-samples between two contiguous integersamples are interpolated by applying filter coefficients to a pluralityof contiguous integer samples including two integer samples. When aprocess of searching a block most similar to the current block isperformed for the interpolated reference picture, not integer sampleunit precision but decimal unit precision may be expressed for themotion vector. Precision or resolution of the motion vector may be setdifferently for each target area to be encoded, e.g., a unit such as theslice, the tile, the CTU, the CU, etc. When such an adaptive motionvector resolution (AMVR) is applied, information on the motion vectorresolution to be applied to each target area should be signaled for eachtarget area. For example, when the target area is the CU, theinformation on the motion vector resolution applied for each CU issignaled. The information on the motion vector resolution may beinformation representing precision of a motion vector difference to bedescribed below.

Meanwhile, the inter predictor 124 may perform inter prediction by usingbi-prediction. In the case of bi-prediction, two reference pictures andtwo motion vectors representing a block position most similar to thecurrent block in each reference picture are used. The inter predictor124 selects a first reference picture and a second reference picturefrom reference picture list 0 (RefPicList0) and reference picture list 1(RefPicList1), respectively. The inter predictor 124 also searchesblocks most similar to the current blocks in the respective referencepictures to generate a first reference block and a second referenceblock. In addition, the prediction block for the current block isgenerated by averaging or weighted-averaging the first reference blockand the second reference block. In addition, motion informationincluding information on two reference pictures used for predicting thecurrent block and information on two motion vectors is delivered to theentropy encoder 155. Here, reference picture list 0 may be constitutedby pictures before the current picture in a display order amongpre-restored pictures, and reference picture list 1 may be constitutedby pictures after the current picture in the display order among thepre-restored pictures. However, although not particularly limitedthereto, the pre-restored pictures after the current picture in thedisplay order may be additionally included in reference picture list 0.Inversely, the pre-restored pictures before the current picture may alsobe additionally included in reference picture list 1.

In order to minimize a bit quantity consumed for encoding the motioninformation, various methods may be used.

For example, when the reference picture and the motion vector of thecurrent block are the same as the reference picture and the motionvector of the neighboring block, information capable of identifying theneighboring block is encoded to deliver the motion information of thecurrent block to the video decoding apparatus. Such a method is referredto as a merge mode.

In the merge mode, the inter predictor 124 selects a predeterminednumber of merge candidate blocks (hereinafter, referred to as a “mergecandidate”) from the neighboring blocks of the current block.

As a neighboring block for deriving the merge candidate, all or some ofa left block A0, a bottom left block A1, a top block B0, a top rightblock B1, and a top left block B2 adjacent to the current block in thecurrent picture may be used as illustrated in FIG. 4 . Further, a blockpositioned within the reference picture (may be the same as or differentfrom the reference picture used for predicting the current block) otherthan the current picture at which the current block is positioned mayalso be used as the merge candidate. For example, a co-located blockwith the current block within the reference picture or blocks adjacentto the co-located block may be additionally used as the merge candidate.If the number of merge candidates selected by the method described aboveis smaller than a preset number, a zero vector is added to the mergecandidate.

The inter predictor 124 configures a merge list including apredetermined number of merge candidates by using the neighboringblocks. A merge candidate to be used as the motion information of thecurrent block is selected from the merge candidates included in themerge list, and merge index information for identifying the selectedcandidate is generated. The generated merge index information is encodedby the entropy encoder 155 and delivered to the video decodingapparatus.

A merge skip mode is a special case of the merge mode. Afterquantization, when all transform coefficients for entropy encoding areclose to zero, only the neighboring block selection information istransmitted without transmitting residual signals. By using the mergeskip mode, it is possible to achieve a relatively high encodingefficiency for images with slight motion, still images, screen contentimages, and the like.

Hereafter, the merge mode and the merge skip mode are collectivelyreferred to as the merge/skip mode.

Another method for encoding the motion information is an advanced motionvector prediction (AMVP) mode.

In the AMVP mode, the inter predictor 124 derives motion vectorpredictor candidates for the motion vector of the current block by usingthe neighboring blocks of the current block. As a neighboring block usedfor deriving the motion vector predictor candidates, all or some of aleft block A0, a bottom left block A1, a top block B0, a top right blockB1, and a top left block B2 adjacent to the current block in the currentpicture illustrated in FIG. 4 may be used. Further, a block positionedwithin the reference picture (may be the same as or different from thereference picture used for predicting the current block) other than thecurrent picture at which the current block is positioned may also beused as the neighboring block used for deriving the motion vectorpredictor candidates. For example, a co-located block with the currentblock within the reference picture or blocks adjacent to the co-locatedblock may be used. If the number of motion vector candidates selected bythe method described above is smaller than a preset number, a zerovector is added to the motion vector candidate.

The inter predictor 124 derives the motion vector predictor candidatesby using the motion vector of the neighboring blocks and determinesmotion vector predictor for the motion vector of the current block byusing the motion vector predictor candidates. In addition, a motionvector difference is calculated by subtracting motion vector predictorfrom the motion vector of the current block.

The motion vector predictor may be acquired by applying a pre-definedfunction (e.g., center value and average value computation, etc.) to themotion vector predictor candidates. In this case, the video decodingapparatus also knows the pre-defined function. Further, since theneighboring block used for deriving the motion vector predictorcandidate is a block in which encoding and decoding are alreadycompleted, the video decoding apparatus may also already know the motionvector of the neighboring block. Therefore, the video encoding apparatusdoes not need to encode information for identifying the motion vectorpredictor candidate. Accordingly, in this case, information on themotion vector difference and information on the reference picture usedfor predicting the current block are encoded.

Meanwhile, the motion vector predictor may also be determined by ascheme of selecting any one of the motion vector predictor candidates.In this case, information for identifying the selected motion vectorpredictor candidate is additional encoded jointly with the informationon the motion vector difference and the information on the referencepicture used for predicting the current block.

The subtractor 130 generates a residual block by subtracting theprediction block generated by the intra predictor 122 or the interpredictor 124 from the current block.

The transformer 140 transforms residual signals in a residual blockhaving pixel values of a spatial domain into transform coefficients of afrequency domain. The transformer 140 may transform residual signals inthe residual block by using a total size of the residual block as atransform unit or also split the residual block into a plurality ofsubblocks and perform the transform by using the subblock as thetransform unit. Alternatively, the residual block is divided into twosubblocks, which are a transform area and a non-transform area, totransform the residual signals by using only the transform area subblockas the transform unit. Here, the transform area subblock may be one oftwo rectangular blocks having a size ratio of 1:1 based on a horizontalaxis (or vertical axis). In this case, a flag (cu_sbt_flag) indicatesthat only the subblock is transformed, and directional(vertical/horizontal) information (cu_sbt_horizontal_flag) and/orpositional information (cu_sbt_pos_flag) are encoded by the entropyencoder 155 and signaled to the video decoding apparatus. Further, asize of the transform area subblock may have a size ratio of 1:3 basedon the horizontal axis (or vertical axis). In this case, a flag(cu_sbt_quad_flag) dividing the corresponding splitting is additionallyencoded by the entropy encoder 155 and signaled to the video decodingapparatus.

Meanwhile, the transformer 140 may perform the transform for theresidual block individually in a horizontal direction and a verticaldirection. For the transform, various types of transform functions ortransform matrices may be used. For example, a pair of transformfunctions for horizontal transform and vertical transform may be definedas a multiple transform set (MTS). The transformer 140 may select onetransform function pair having highest transform efficiency in the MTSand transform the residual block in each of the horizontal and verticaldirections. Information (mts_jdx) on the transform function pair in theMTS is encoded by the entropy encoder 155 and signaled to the videodecoding apparatus.

The quantizer 145 quantizes the transform coefficients output from thetransformer 140 using a quantization parameter and outputs the quantizedtransform coefficients to the entropy encoder 155. The quantizer 145 mayalso immediately quantize the related residual block without thetransform for any block or frame. The quantizer 145 may also applydifferent quantization coefficients (scaling values) according topositions of the transform coefficients in the transform block. Aquantization matrix applied to transform coefficients quantized arrangedin 2 dimensional may be encoded and signaled to the video decodingapparatus.

The rearrangement unit 150 may perform realignment of coefficient valuesfor quantized residual values.

The rearrangement unit 150 may change a 2D coefficient array to a 1Dcoefficient sequence by using coefficient scanning. For example, therearrangement unit 150 may output the 1D coefficient sequence byscanning a DC coefficient to a high-frequency domain coefficient byusing a zig-zag scan or a diagonal scan. According to the size of thetransform unit and the intra prediction mode, vertical scan of scanninga 2D coefficient array in a column direction and horizontal scan ofscanning a 2D block type coefficient in a row direction may also be usedinstead of the zig-zag scan. In other words, according to the size ofthe transform unit and the intra prediction mode, a scan method to beused may be determined among the zig-zag scan, the diagonal scan, thevertical scan, and the horizontal scan.

The entropy encoder 155 generates a bitstream by encoding a sequence of1D quantized transform coefficients output from the rearrangement unit150 by using various encoding schemes including a Context-based AdaptiveBinary Arithmetic Code (CABAC), an Exponential Golomb, or the like.

Further, the entropy encoder 155 encodes information such as a CTU size,a CTU split flag, a QT split flag, an MTT split type, an MTT splitdirection, etc., related to the block splitting to allow the videodecoding apparatus to split the block equally to the video encodingapparatus. Further, the entropy encoder 155 encodes information on aprediction type indicating whether the current block is encoded by intraprediction or inter prediction. The entropy encoder 155 encodes intraprediction information (i.e., information on an intra prediction mode)or inter prediction information (in the case of the merge mode, a mergeindex and in the case of the AMVP mode, information on the referencepicture index and the motion vector difference) according to theprediction type. Further, the entropy encoder 155 encodes informationrelated to quantization, i.e., information on the quantization parameterand information on the quantization matrix.

The inverse quantizer 160 dequantizes the quantized transformcoefficients output from the quantizer 145 to generate the transformcoefficients. The inverse transformer 165 transforms the transformcoefficients output from the inverse quantizer 160 into a spatial domainfrom a frequency domain to restore the residual block.

The adder 170 adds the restored residual block and the prediction blockgenerated by the predictor 120 to restore the current block. Pixels inthe restored current block may be used as reference pixels whenintra-predicting a next-order block.

The loop filter unit 180 performs filtering for the restored pixels inorder to reduce blocking artifacts, ringing artifacts, blurringartifacts, etc., which occur due to block based prediction andtransform/quantization. The loop filter unit 180 as an in-loop filtermay include all or some of a deblocking filter 182, a sample adaptiveoffset (SAO) filter 184, and an adaptive loop filter (ALF) 186.

The deblocking filter 182 filters a boundary between the restored blocksin order to remove a blocking artifact, which occurs due to block unitencoding/decoding, and the SAO filter 184 and the ALF 186 performadditional filtering for a deblocked filtered video. The SAO filter 184and the ALF 186 are filters used for compensating differences betweenthe restored pixels and original pixels, which occur due to lossycoding. The SAO filter 184 applies an offset as a CTU unit to enhance asubjective image quality and encoding efficiency. On the other hand, theALF 186 performs block unit filtering and compensates distortion byapplying different filters by dividing a boundary of the correspondingblock and a degree of a change amount. Information on filtercoefficients to be used for the ALF may be encoded and signaled to thevideo decoding apparatus.

The restored block filtered through the deblocking filter 182, the SAOfilter 184, and the ALF 186 is stored in the memory 190. When all blocksin one picture are restored, the restored picture may be used as areference picture for inter predicting a block within a picture to beencoded afterwards.

FIG. 5 is a functional block diagram of a video decoding apparatus thatmay implement the technologies of the present disclosure. Hereinafter,referring to FIG. 5 , the video decoding apparatus and components of theapparatus are described.

The video decoding apparatus may include an entropy decoder 510, arearrangement unit 515, an inverse quantizer 520, an inverse transformer530, a predictor 540, an adder 550, a loop filter unit 560, and a memory570.

Similar to the video encoding apparatus of FIG. 1 , each component ofthe video decoding apparatus may be implemented as hardware or softwareor implemented as a combination of hardware and software. Further, afunction of each component may be implemented as the software, and amicroprocessor may also be implemented to execute the function of thesoftware corresponding to each component.

The entropy decoder 510 extracts information related to block splittingby decoding the bitstream generated by the video encoding apparatus todetermine a current block to be decoded and extracts predictioninformation required for restoring the current block and information onthe residual signals.

The entropy decoder 510 determines the size of the CTU by extractinginformation on the CTU size from a sequence parameter set (SPS) or apicture parameter set (PPS) and splits the picture into CTUs having thedetermined size. In addition, the CTU is determined as a highest layerof the tree structure, i.e., a root node, and split information for theCTU may be extracted to split the CTU by using the tree structure.

For example, when the CTU is split by using the QTBTTT structure, afirst flag (QT_split_flag) related to splitting of the QT is firstextracted to split each node into four nodes of the lower layer. Inaddition, a second flag (mtt_split_flag), a split direction(vertical/horizontal), and/or a split type (binary/ternary) related tosplitting of the MTT are extracted with respect to the nodecorresponding to the leaf node of the QT to split the corresponding leafnode into an MTT structure. As a result, each of the nodes below theleaf node of the QT is recursively split into the BT or TT structure.

As another example, when the CTU is split by using the QTBTTT structure,a CU split flag (split_cu_flag) indicating whether the CU is split isextracted. When the corresponding block is split, the first flag(QT_split_flag) may also be extracted. During a splitting process, withrespect to each node, recursive MTT splitting of 0 times or more mayoccur after recursive QT splitting of 0 times or more. For example, withrespect to the CTU, the MTT splitting may immediately occur or on thecontrary, only QT splitting of multiple times may also occur.

As another example, when the CTU is split by using the QTBT structure,the first flag (QT_split_flag) related to the splitting of the QT isextracted to split each node into four nodes of the lower layer. Inaddition, a split flag (split_flag) indicating whether the nodecorresponding to the leaf node of the QT being further split into theBT, and split direction information are extracted.

Meanwhile, when the entropy decoder 510 determines a current block to bedecoded by using the splitting of the tree structure, the entropydecoder 510 extracts information on a prediction type indicating whetherthe current block is intra predicted or inter predicted. When theprediction type information indicates the intra prediction, the entropydecoder 510 extracts a syntax element for intra prediction information(intra prediction mode) of the current block. When the prediction typeinformation indicates the inter prediction, the entropy decoder 510extracts information representing a syntax element for inter predictioninformation, i.e., a motion vector and a reference picture to which themotion vector refers.

Further, the entropy decoder 510 extracts quantization relatedinformation and extracts information on the quantized transformcoefficients of the current block as the information on the residualsignals.

The rearrangement unit 515 may change a sequence of 1D quantizedtransform coefficients entropy-decoded by the entropy decoder 510 to a2D coefficient array (i.e., block) again in a reverse order to thecoefficient scanning order performed by the video encoding apparatus.

The inverse quantizer 520 dequantizes the quantized transformcoefficients and dequantizes the quantized transform coefficients byusing the quantization parameter. The inverse quantizer 520 may alsoapply different quantization coefficients (scaling values) to thequantized transform coefficients arranged in 2D. The inverse quantizer520 may perform dequantization by applying a matrix of the quantizationcoefficients (scaling values) from the video encoding apparatus to a 2Darray of the quantized transform coefficients.

The inverse transformer 530 generates the residual block for the currentblock by restoring the residual signals by inversely transforming thedequantized transform coefficients into the spatial domain from thefrequency domain.

Further, when the inverse transformer 530 inversely transforms a partialarea (subblock) of the transform block, the inverse transformer 530extracts a flag (cu_sbt_flag) that only the subblock of the transformblock is transformed, directional (vertical/horizontal) information(cu_sbt_horizontal_flag) of the subblock, and/or positional information(cu_sbt_pos_flag) of the subblock. The inverse transformer 530 alsoinversely transforms the transform coefficients of the correspondingsubblock into the spatial domain from the frequency domain to restorethe residual signals and fills an area, which is not inverselytransformed, with a value of “0” as the residual signals to generate afinal residual block for the current block.

Further, when the MTS is applied, the inverse transformer 530 determinesthe transform index or the transform matrix to be applied in each of thehorizontal and vertical directions by using the MTS information(mts_idx) signaled from the video encoding apparatus. The inversetransformer 530 also performs inverse transform for the transformcoefficients in the transform block in the horizontal and verticaldirections by using the determined transform function.

The predictor 540 may include an intra predictor 542 and an interpredictor 544. The intra predictor 542 is activated when the predictiontype of the current block is the intra prediction, and the interpredictor 544 is activated when the prediction type of the current blockis the inter prediction.

The intra predictor 542 determines the intra prediction mode of thecurrent block among the plurality of intra prediction modes from thesyntax element for the intra prediction mode extracted from the entropydecoder 510. The intra predictor 542 also predicts the current block byusing neighboring reference pixels of the current block according to theintra prediction mode.

The inter predictor 544 determines the motion vector of the currentblock and the reference picture to which the motion vector refers byusing the syntax element for the inter prediction mode extracted fromthe entropy decoder 510.

The adder 550 restores the current block by adding the residual blockoutput from the inverse transformer 530 and the prediction block outputfrom the inter predictor 544 or the intra predictor 542. Pixels withinthe restored current block are used as a reference pixel upon intrapredicting a block to be decoded afterwards.

The loop filter unit 560 as an in-loop filter may include a deblockingfilter 562, an SAO filter 564, and an ALF 566. The deblocking filter 562performs deblocking filtering a boundary between the restored blocks inorder to remove the blocking artifact, which occurs due to block unitdecoding. The SAO filter 564 and the ALF 566 perform additionalfiltering for the restored block after the deblocking filtering in orderto compensate differences between the restored pixels and originalpixels, which occur due to lossy coding. The filter coefficients of theALF are determined by using information on filter coefficients decodedfrom the bitstream.

The restored block filtered through the deblocking filter 562, the SAOfilter 564, and the ALF 566 is stored in the memory 570. When all blocksin one picture are restored, the restored picture may be used as areference picture for inter predicting a block within a picture to beencoded afterwards.

The present disclosure in some embodiments relates to encoding anddecoding video images as described above. More specifically, the presentdisclosure provides a video coding method and an apparatus forpredicting a chroma component of a current block, by using a deeplearning-based matrix operation, from a chroma component spatiallyadjacent to a chroma block of the current block and from a lumacomponent spatially adjacent to a luma block corresponding to the chromablock.

The following embodiments may be commonly applied to the intra predictor122 in the video encoding apparatus and the intra predictor 542 in thevideo decoding apparatus.

In the following description, the term ‘target block’ to beencoded/decoded may be used interchangeably with the current block orcoding unit (CU) as described above, or the term ‘target block’ mayrefer to some area of the coding unit.

Hereinafter, a target block includes a luma block including a lumacomponent and a chroma block including a chroma component. The chromablock of the target block is represented by the target chroma block orthe current chroma block. The luma block of the target block isrepresented by the target luma block or the current luma block.

I. Cross-Component Prediction

In performing prediction in a video encoding/decoding method and anapparatus, a method of generating a prediction block of a current blockfrom a color component that is different from a color component of thetarget block to be encoded and decoded is defined as a cross-componentprediction. In the Versatile Video Coding (VVC) technique,cross-component prediction is used to intra-predict the current chromablock, which is called cross-component linear model (CCLM) prediction.The following describes CCLM prediction, i.e., cross-componentprediction using a linear model.

FIG. 6 is an example diagram illustrating the neighboring pixelsreferenced for cross-component prediction.

To perform cross-component prediction of a target chroma block, leftreference pixels and top reference pixels of a luma block correspondingto the target chroma block may be utilized, and left reference pixelsand top reference pixels of the target chroma block may be utilized, asillustrated in FIG. 6 . Hereinafter, the left reference pixels and thetop reference pixels are collectively referred to as reference pixels,neighboring pixels, or adjacent pixels. Furthermore, the referencepixels of the chroma component are represented by chroma referencepixels, and the reference pixels of the luma component are representedby luma reference pixels. In the example of FIG. 6 , the size of thechroma block, i.e., the number of pixels, is represented by N×N (where Nis a natural number).

In CCLM prediction, a prediction block that is a predictor of the targetchroma block is generated by deriving a linear model between thereference pixels of a luma block and the reference pixels of a chromablock and then applying that linear model to the reconstructed pixels ofthe corresponding luma block.

FIG. 7 is an example diagram conceptually illustrating the derivation ofa linear model for cross-component prediction.

In one example, a linear function may be derived based on a minimumvalue of a neighboring luma pixel, a chroma value co-located with theneighboring luma pixel, a maximum value of a neighboring luma pixel, anda chroma value co-located with the neighboring luma pixel. In theexample of FIG. 7 , point A is an ordered pair of (minimum value of aneighboring luma pixel, and chroma value co-located with a neighboringluma pixel) and point B is an ordered pair of (maximum value of aneighboring luma pixel, and chroma value co-located with a neighboringluma pixel).

In another embodiment, instead of deriving a linear model using only oneminimum and one maximum value each, a linear model may be derived usingan average of a plurality of minimum values and using an average of aplurality of maximum values. In this case, two or more pixel values maybe used as the plurality of minimum and maximum values.

In another embodiment, after deriving one or more linear models, the oneor more linear models may be used to perform cross-component estimationof the target chroma block.

For example, when using two linear models, a point C is set that is the(median of the neighboring luma pixels, and chroma value co-located withthe surrounding luma pixels). The linear model between point A and pointC is defined as the first linear model, and the linear model betweenpoint C and point B is defined as the second linear model, allowingdifferent linear models to be applied to the cross-component predictiondepending on the range of luma pixel values covered. Thus, depending onthe number of medians, cross-component prediction using one or morelinear models may use two linear models, three linear models, or morelinear models.

Meanwhile, for cross-component prediction using one or more linearmodels, the video encoding apparatus may signal the number of linearmodels directly to the video decoding apparatus to indicate how manylinear models to be used for the target block.

Alternatively, the number of linear models to be applied to the currentblock may be derived based on the size of the current block. Forexample, for a current block having a size of 32×32 or larger, the videoencoding/decoding apparatus may use two linear models to performcross-component prediction. In this case, the size of 32×32 is anexample and is not necessarily limited thereto. Namely, the videoencoding/decoding apparatus according to the present embodiment may usea preset size of the current block, such as 32×16, 16×16, or the like,as a basis for determining the number of linear models.

II. Matrix-Based Intra-Cross-Component Prediction

FIG. 8 is an example diagram conceptually illustrating a matrix-basedintra-cross-component prediction device, according to at least oneembodiment of the present disclosure.

A matrix-based cross-component prediction device (hereinafter,“prediction device”) according to this embodiment utilizes a deeplearning-based estimating model that performs matrix operations forgenerating a chroma prediction block, which is a predictor, fromneighboring pixels spatially adjacent to a target chroma block and fromneighboring pixels adjacent to a luma block corresponding to the targetchroma block. The prediction device includes all or some of an inputunit 802, a preprocessor 804, and an estimator 806. Such a predictiondevice may be common to the intra predictor 122 in the video encodingapparatus and the intra predictor 542 in the video decoding apparatus,as described above. When included in the intra predictor 122 in thevideo encoding apparatus, the prediction device components included inthe video encoding apparatus according to the present embodiment are notnecessarily limited to those illustrated. For example, the videoencoding apparatus may further include a training unit (not shown) fortraining the deep learning model included in the prediction device, orthe video encoding apparatus may be implemented in conjunction with anexternal training unit.

The input unit 802 obtains reference pixels. The reference pixels hereinclude, for a target chroma block, chroma reference pixels spatiallyadjacent to the target chroma block and include luma reference pixelsadjacent to a luma block corresponding to the target chroma block. Thereference pixels illustrated in FIG. 8 are the same as the referencepixels illustrated in FIG. 6 . Thus, the reference pixels may includeleft neighboring pixels and top neighboring pixels of the chroma blockor luma block, as described above. The reference pixels are transferredto the preprocessor 804.

When the input unit 802 obtains the chroma reference pixels of thetarget chroma block, the input unit 802 may utilize all or some of theleft neighboring pixels and the top neighboring pixels depending on thesize of the current block. At this time, when utilizing some of theneighboring pixels, to select them, the input unit 802 may utilize adownsampling method, a method of selecting one pixel for every certainpixel distance, or the like.

When obtaining the luma reference pixels of the luma block, the inputunit 802 may utilize all or some of the left neighboring pixels and thetop neighboring pixels depending on the size of the current block.Further, the input unit 802 may determine the locations and values ofthe luma reference pixels of the luma block according to the colorformat of the current picture. For example, as illustrated in FIG. 8 ,the reference pixels are obtained in a YUV 4:2:0 format. As anotherexample, for a YUV 4:2:2 or YUV 4:4:4 format, the input unit 802 mayselect the reference pixels at locations different from thoseillustrated in FIG. Band determine their values.

In obtaining the reference pixels, the input unit 802 is not limited tousing pixels corresponding to one row or one column, as illustrated inFIG. 8 . For example, the input unit 802 may use two, three, four, ormore rows for the pixels on the top, and two, three, four, or morecolumns for the pixels on the left.

The preprocessor 804 preprocesses the reference pixels of the targetchroma block and the reference pixels of the luma block to generatevectorized reference pixels. The preprocessor 804 may rearrange thereference pixels to generate an array of 2D vector, i.e., a matrix. Atthis time, the preprocessor 804 may separately rearrange the chromacomponents and luma components of the reference pixels based on thelocations of the reference pixels to generate the 2D vector, asillustrated in FIG. 9 .

Alternatively, and differently from the example in FIG. 9 , thepreprocessor 804 may alternately rearrange the chroma components andluma components of the reference pixels to generate a 2D vector. Forexample, the preprocessor 804 may alternately rearrange the referencepixels in the following order: the top chroma component, the top lumacomponent, the left chroma component, and the left luma component.

In another embodiment, the preprocessor 804 may separately concatenatethe chroma components and luma components of the reference pixels togenerate a 1D vector. Alternatively, the preprocessor 804 mayalternately concatenate the chroma components and luma components of thereference pixels to generate a 1D vector.

The rearranged reference pixels, either as a 2D vector or a 1D vector,are transferred to the estimator 806.

The estimator 806 performs cross-component prediction by using the deeplearning-based estimating model to generate a chroma prediction block ofthe current block from the 2D vector or 1D vector of reference pixels.Here, the estimating model represents a deep neural network includingone or more neural layers. The estimating model may include, as neurallayers, all or some of convolutional layers, fully-connected layers, andpooling layers. The estimating model may be implemented in a form thatincludes only one type of neural layer, or the estimating model mayfurther include a combination of different types of layers. For example,in one embodiment, the estimating model may be implemented with threeconvolutional layers, one fully connected layer, and one pooling layer.

The estimating model may take as input a 2D vector, i.e., a matrix,delivered by the preprocessor 804 and may generate a chroma predictionblock in matrix form so that matrix-based operations are performedwithin the estimating model. Additionally, even when a 1D vector isinputted, matrix-based operations may be performed within the estimatingmodel for the estimating model to generate a chroma prediction block inmatrix form. In this case, the estimating model generates a chromaprediction block with the same size as the current chroma block.

In another embodiment, a plurality of matrix-formed kernels may bepre-trained to reduce the complexity of the estimation operation whenperforming deep learning-based cross-component prediction using anestimating model. Using one of the multiple kernels, the estimator 806may compute a matrix multiplication between an array of inputtedreference pixels and the trained kernel. In this case, an index may beutilized to indicate one of the plurality of kernels.

As described above, the estimator 806 may perform deep learning-basedcross-component prediction to generate a chroma prediction block for thecurrent block. The example in FIG. 8 illustrates the prediction devicewhere the inputted current chroma block has a size of 8×8 and theoutputted chroma prediction block has the same size of 8×8.

Meanwhile, the estimating model may be pre-trained by the training unitfor allowing the estimating model learns to generate, from the inputtedreference pixels, a chroma prediction block that is close to theoriginal chroma block. In this case, one example of a loss function fortraining may be defined as an L2 metric between the chroma predictionblock and the original chroma block. Alternatively, any metric that canrepresent the difference between the chroma prediction block and theoriginal chroma block may be utilized as the loss function.

Meanwhile, the parameters of the trained estimating model may be sharedbetween the video encoding apparatus and the video decoding apparatus.

In general, the size of the chroma prediction block as the output, i.e.,the number of pixels, can directly affect the complexity and computationof the estimating model. Thus, in terms of reducing the computation ofthe estimating model, instead of generating a chroma prediction blockwith the same size as the current chroma block, the prediction devicemay generate a reduced chroma prediction block with a smaller size thanthe current chroma block. The prediction device may then post-processthe reduced chroma prediction block to generate a chroma predictionblock that is interpolated into the same size as the current chromablock.

FIG. 10 is an example diagram conceptually illustrating a matrix-basedcross-component prediction device, according to another embodiment ofthe present disclosure.

A prediction device according to this embodiment uses a deeplearning-based estimating model that performs matrix operations togenerate reduced chroma prediction blocks from reference pixels and theninterpolates the reduced chroma prediction blocks to generate a finalchroma prediction block. The prediction device may include aninterpolator 1002 further to all or some of the input unit 802,preprocessor 804, or estimator 806. Such a prediction device may becommon to the intra predictor 122 in the video encoding apparatus andthe intra predictor 542 in the video decoding apparatus, as describedabove. When included in the intra predictor 122 in the video encodingapparatus, the prediction device components included in the videoencoding apparatus according to the present embodiment are notnecessarily limited to those illustrated. For example, the videoencoding apparatus may further include a training unit (not shown) fortraining the deep learning model included in the prediction device, orthe video encoding apparatus may be implemented in conjunction with anexternal training unit.

Hereinafter, the prediction device illustrated in FIG. 10 is describedonly for the differences from the example of FIG. 8 . Thus, theoperation of the input unit 802 and the preprocessor 804 remains thesame, and a detailed description thereof is omitted.

The estimator 806 performs cross-component prediction by using a deeplearning-based estimating model to generate a chroma prediction block ofthe current block from the reference pixels. The size of the generatedchroma prediction block, i.e., the number of pixels, may be differentfrom the number of pixels in the inputted target chroma block. Forexample, the number of pixels in the chroma prediction block may besmaller than the number of pixels in the target chroma block to reducethe computation of the estimating model. For example, as illustrated inFIG. 10 , pixels in the chroma prediction block may be generated by theestimator 806 for their presence at locations where the pixels in thetarget chroma block are subsampled by half in the row/column direction,respectively.

The estimator 806 transfers the reduced chroma prediction block to theinterpolator 1002.

FIG. 11 is an example diagram illustrating a reduced chroma predictionblock, according to at least one embodiment of the present disclosure.

The pixels in the reduced chroma prediction block may be pixels presentat subsampled locations in the row or column direction in the targetchroma block. As illustrated in FIG. 11 , the pixels in the reducedchroma prediction block may be present at locations in the target chromablock that are subsampled in both the row and column directions,locations that are subsampled in the column direction only, and/orlocations that are subsampled in the row direction only, etc. Dependingon where the pixels of the reduced chroma prediction block are located,the interpolator 1002 may use different interpolation methods.

The interpolator 1002 generates pixel-to-pixel values according to apredefined operation so that the size (or number of pixels) of theinterpolated chroma prediction block is equal to the size (or number ofpixels) of the current chroma block. Thus, the interpolator 1002generates the interpolated chroma prediction block. Here, the predefinedoperation refers to filtering the pixels of the reduced chromaprediction block by using an interpolation filter. As an interpolationfilter, the interpolator 1002 may utilize a 6-tap interpolation filter,an 8-tap interpolation filter, a bi-linear interpolation filter, and thelike.

In performing the above interpolation filter, the interpolator 1002 mayutilize one predefined interpolation filter or may select one of theavailable interpolation filters by utilizing information of the blockadjacent to the current block. In another embodiment, the video encodingapparatus may signal an index indicating the interpolation filter to thevideo decoding apparatus by each of certain encoding units.

The foregoing embodiments utilize, but are not necessarily limited to,as reference pixels, neighboring pixels spatially adjacent to the targetchroma block and utilize neighboring pixels adjacent to the luma blockcorresponding to the target chroma block. For example, to improve thecross-component prediction performance of the target chroma block, thereconstructed pixels of the luma block corresponding to the targetchroma block may additionally be utilized as reference pixels.

FIG. 12 is an example diagram conceptually illustrating across-component prediction device that further utilizes reconstructedluma pixels, according to another embodiment of the present disclosure.

The prediction device according to this embodiment utilizes a deeplearning-based estimating model that performs matrix operations togenerate a chroma prediction block from reference pixels andreconstructed pixels in a luma block. The prediction device illustratedin FIG. 12 includes the same components as the example in FIG. 8 .

However, further to the reference pixels, the input 802 may obtain thereconstructed pixels of the luma block corresponding to the targetchroma block. The reference pixels here may include chroma referencepixels spatially adjacent to the target chroma block, and luma referencepixels adjacent to the luma block corresponding to the target chromablock. Further, the reconstructed pixels of the luma block representreconstructed pixels before being transferred to the loop filter units180, 560. The reconstructed pixels may be subsampled as illustrated inFIG. 12 . The input unit 802 transfers the obtained reference pixels andthe reconstructed pixels to the preprocessor 804.

The preprocessor 804 rearranges the received reference pixels andreconstructed pixels to generate a two-dimensional (2D) vector or aone-dimensional (1D) vector. The preprocessor 804 transfers the 2Dvector or 1D vector to the estimator 806.

The prediction device may perform subsequent operations, as in theexample of FIG. 8 .

FIG. 13 is an example diagram conceptually illustrating across-component prediction device that further utilizes reconstructedluma pixels, according to yet another embodiment of the presentdisclosure.

The prediction device according to this embodiment uses a deeplearning-based estimating model that performs matrix operations togenerate a reduced chroma prediction block from the reference pixels andthe reconstructed pixels in the luma block and then interpolates thereduced chroma prediction block to generate a final chroma predictionblock. The prediction device illustrated in FIG. 13 includes the samecomponents as the example in FIG. 10 .

However, further to the reference pixels, the input unit 802 may obtainreconstructed pixels of the luma block corresponding to the targetchroma block. The reference pixels here may include chroma referencepixels spatially adjacent to the target chroma block and may includeluma reference pixels adjacent to the luma block corresponding to thetarget chroma block. Further, the reconstructed pixels in the luma blockrepresent reconstructed pixels before being transferred to the loopfilter units 180, 560. The reconstructed pixels may be subsampled asillustrated in FIG. 13 . The input unit 802 transfers all obtainedpixels to the preprocessor 804.

The preprocessor 804 rearranges the received reference and reconstructedpixels to generate a 2D vector or a 1D vector. The preprocessor 804transfers the 2D vector or 1D vector to the estimator 806.

The prediction device may perform subsequent operations as in theexample of FIG. 10 .

Referring now to FIG. 14 , a method performed by the prediction deviceis described for performing a cross-component prediction to predict achroma component of the current block by using a luma component.

FIG. 14 is a flowchart of a cross-component prediction method, accordingto at least one embodiment of the present disclosure.

The prediction device obtains reference pixels (S1400). Here, thereference pixels include chroma reference pixels spatially adjacent to achroma block of the current block and include luma reference pixelsadjacent to a luma block corresponding to the chroma block.

When obtaining the chroma reference pixels of the chroma block, theprediction device may utilize all or some of the left neighboring pixelsand the top neighboring pixels, depending on the size of the currentblock. When obtaining the luma reference pixels of the luma block, theprediction device may utilize all or some of the left neighboring pixelsand the top neighboring pixels, depending on the size of the currentblock. Further, the prediction device may determine the locations andvalues of the luma reference pixels of the luma block based on the colorformat of the current picture.

The prediction device may obtain the reference pixels from one or morecolumns adjacent to the left of the chroma block and the luma block andfrom one or more rows adjacent to the top of the chroma block and theluma block.

The prediction device may rearrange the reference pixels to generate aninput block in the form of a 1D vector or a 2D vector (S1402).

The prediction device may separately rearrange the chroma components andluma components of the reference pixels based on the locations of thereference pixels to generate a 2D vector, i.e., a matrix. Alternatively,the prediction device may alternately rearrange the chroma componentsand luma components of the reference pixels to generate a 2D vector.

In another embodiment, the prediction device may separately concatenatethe chroma components and luma components of the reference pixels togenerate a 1D vector. Alternatively, the prediction device mayalternately concatenate the chroma components and luma components of thereference pixels to generate a 1D vector.

The prediction device transfers the rearranged input block in the formof the 2D vector or 1D vector to the estimating model.

The prediction device inputs the input block into the deeplearning-based estimating model to generate a chroma prediction block ofthe current block (S1404). The prediction device may input therearranged input blocks in the form of the 2D vector or 1D vector intothe estimating model to perform cross-component prediction. Here, theestimating model represents a deep neural network including one or moreneural layers.

The estimating model may accept 2D vector, i.e., matrix, as input togenerate matrix-formed chroma prediction blocks, allowing matrix-basedoperations to be performed inside the estimating model. Alternatively,if 1D vector is inputted, matrix-based operations may be performedinside the estimating model for the estimating model to generatematrix-formed chroma prediction blocks. In this case, the estimatingmodel generates a chroma prediction block having the same size as thecurrent chroma block.

Meanwhile, the estimating model may be pre-trained by the training unitto learn to generate, from the inputted reference pixels, a chromaprediction block that approximates the original chroma block. Theparameters of the trained estimating model may be shared between thevideo encoding apparatus and the video decoding apparatus.

Hereinafter, using the illustration of FIG. 15 , a method performed bythe estimating model for making cross-component prediction is describedfor when the estimating model generates a reduced chroma predictionblock.

FIG. 15 is a flowchart of a cross-component prediction method, accordingto another embodiment of the present disclosure.

The prediction device obtains reference pixels (S1500). Here, thereference pixels include chroma reference pixels spatially adjacent to achroma block of the current block and include luma reference pixelsadjacent to a luma block corresponding to the chroma block.

The prediction device rearranges the reference pixels to generate aninput block in the form of a 1D vector or a 2D vector (S1502).

The prediction device inputs the input block into a deep learning-basedestimating model to generate a reduced chroma prediction block of thecurrent block (S1504). The prediction device may input the rearrangedinput block in the form of the 2D vector or 1D vector into theestimating model to perform cross-component prediction. At this time,the estimating model generates a reduced chroma prediction block that issmaller than the current chroma block, in terms of reduced computation.

The pixels of the reduced chroma prediction block may be pixels presentat locations that are subsampled in the row or column direction in thecurrent chroma block.

The prediction device applies predefined interpolation filtering to thepixels of the reduced chroma prediction block to generate aninterpolated chroma prediction block (S1506). The prediction device maygenerate the interpolated chroma prediction block by generating valuesbetween pixels according to the interpolation filtering so that the size(or number of pixels) of the interpolated chroma prediction block andthe size (or number of pixels) of the current chroma block are the same.Here, interpolation filtering refers to the process of filtering thepixels of the reduced chroma prediction block by using an interpolationfilter.

The following describes a method performed by the prediction device forfurther utilizing the reconstructed pixels of the luma block to performcross-component prediction.

FIG. 16 is a flowchart of a cross-component prediction method thatfurther utilizes reconstructed pixels in a luma block, according to atleast one embodiment of the present disclosure.

The prediction device obtains reference pixels and reconstructed pixels(S1600). Here, the reference pixels include chroma reference pixelsspatially adjacent to a chroma block in the current block and includeluma reference pixels adjacent to a luma block corresponding to thechroma block. Further, the reconstructed pixels represent thereconstructed pixels of the luma block.

In obtaining the reconstructed pixels of the luma block, the predictiondevice may utilize all or a subsampled portion of the pixels of the lumablock.

The prediction device rearranges the reference pixels and thereconstructed pixels to generate an input block in the form of a 1Dvector or a 2D vector (S1602).

The prediction device inputs the input block into a deep learning-basedestimating model to generate a chroma prediction block of the currentblock (S1604). The prediction device may input the rearranged inputblock in the form of the 2D vector or 1D vector into the estimatingmodel to perform cross-component prediction. Here, the estimating modelrepresents a deep neural network including one or more neural layers.

For the case where the estimating model generates a reduced chromaprediction block, a method is described below where the predictiondevice further utilizes the reconstructed pixels of the luma block toperform the cross-component prediction.

FIG. 17 is a flowchart of a cross-component prediction method thatfurther utilizes reconstructed pixels in a luma block, according toanother embodiment of the present disclosure.

The prediction device obtains reference pixels and reconstructed pixels(S1700). Here, the reference pixels include chroma reference pixelsspatially adjacent to a chroma block of the current block and includeluma reference pixels adjacent to a luma block corresponding to thechroma block. Further, the reconstructed pixels represent thereconstructed pixels of the luma block.

The prediction device rearranges the reference pixels and thereconstructed pixels to generate an input block in the form of a 1Dvector or a 2D vector (S1702).

The prediction device inputs the input block into a deep learning-basedestimating model to generate a reduced chroma prediction block of thecurrent block (S1704). The prediction device may input the rearrangedinput block in the form of the 2D vector or 1D vector into theestimating model to perform cross-component prediction. At this point,the estimating model generates the reduced chroma prediction block thatis smaller in size than the current chroma block, in terms of reducedcomputation.

The prediction device applies predefined interpolation filtering to thepixels of the reduced chroma prediction block to generate aninterpolated chroma prediction block (S1706). Here, interpolationfiltering refers to filtering the pixels of the reduced chromaprediction block by using an interpolation filter.

Although the steps in the respective flowcharts are described to besequentially performed, the steps merely instantiate the technical ideaof some embodiments of the present disclosure. Therefore, a personhaving ordinary skill in the art to which this disclosure pertains couldperform the steps by changing the sequences described in the respectivedrawings or by performing two or more of the steps in parallel. Hence,the steps in the respective flowcharts are not limited to theillustrated chronological sequences.

It should be understood that the above description presents illustrativeembodiments that may be implemented in various other manners. Thefunctions described in some embodiments may be realized by hardware,software, firmware, and/or their combination. It should also beunderstood that the functional components described in thisspecification are labeled by “ . . . unit” to strongly emphasize thepossibility of their independent realization.

Meanwhile, various methods or functions described in some embodimentsmay be implemented as instructions stored in a non-transitory recordingmedium that can be read and executed by one or more processors. Thenon-transitory recording medium may include, for example, various typesof recording devices in which data is stored in a form readable by acomputer system. For example, the non-transitory recording medium mayinclude storage media such as erasable programmable read-only memory(EPROM), flash drive, optical drive, magnetic hard drive, and solidstate drive (SSD) among others.

Although embodiments of the present disclosure have been described forillustrative purposes, those having ordinary skill in the art to whichthis disclosure pertains should appreciate that various modifications,additions, and substitutions are possible, without departing from theidea and scope of the present disclosure. Therefore, embodiments of thepresent disclosure have been described for the sake of brevity andclarity. The scope of the technical idea of the embodiments of thepresent disclosure is not limited by the illustrations. Accordingly,those having ordinary skill in the art to which this disclosure pertainsshould understand that the scope of the present disclosure should notlimited by the above explicitly described embodiments but by the claimsand equivalents thereof.

REFERENCE NUMERALS

-   -   122: intra predictor    -   542: intra predictor    -   802: input unit    -   804: preprocessor    -   806: estimator    -   1002: interpolator

What is claimed is:
 1. A method performed by a video decoding apparatusfor predicting a chroma component of a current block using a lumacomponent, the method comprising: obtaining reference pixels thatinclude chroma reference pixels spatially adjacent to a chroma block ofthe current block and include luma reference pixels adjacent to a lumablock corresponding to the chroma block; generating an input blockformed as a one-dimensional (1D) vector or a two-dimensional (2D) vectorby rearranging the reference pixels; and generating a chroma predictionblock of the current block by inputting the input block into anestimating model that is a deep learning-based model.
 2. The method ofclaim 1, wherein obtaining the reference pixels comprises: obtaining allor some of left neighboring pixels and top neighboring pixels of thechroma block as the chroma reference pixels.
 3. The method of claim 1,wherein obtaining the reference pixels comprises: obtaining all or someof left neighboring pixels and top neighboring pixels of the luma blockas the luma reference pixels.
 4. The method of claim 3, whereinobtaining the reference pixels comprises: determining locations andvalues of the luma reference pixels according to a color format of acurrent picture comprising the current block.
 5. The method of claim 1,wherein obtaining the reference pixels comprises: obtaining thereference pixels from one or more columns adjacent to a left side of thechroma block and a left side of the luma block, and one or more rowsadjacent to a top of the chroma block and a top of the luma block. 6.The method of claim 1, wherein generating the input block comprises:rearranging chroma components and luma components of the referencepixels separately and respectively based on locations of the referencepixels.
 7. The method of claim 1, wherein the estimating model isimplemented as a deep neural network comprising at least one or moreneural layers and is configured to perform matrix-based operations onthe input block.
 8. The method of claim 1, wherein generating the chromaprediction block comprises: causing the estimating model to generate thechroma prediction block to include pixels that are equal in number topixels of the chroma block.
 9. The method of claim 1, wherein generatingthe chroma prediction block comprises: causing the estimating model togenerate a reduced chroma prediction block to include pixels fewer thanpixels of the chroma block.
 10. The method of claim 9, wherein thepixels of the reduced chroma prediction block are present at locationssubsampled in a row or column direction in the chroma block.
 11. Themethod of claim 9, further comprising: applying a predefinedinterpolation filtering to the pixels of the reduced chroma predictionblock to generate an interpolated chroma prediction block having pixelsthat are equal in number to the pixels of the chroma block.
 12. A methodperformed by a video encoding apparatus for predicting a chromacomponent of a current block using a luma component, the methodcomprising: obtaining reference pixels that include chroma referencepixels spatially adjacent to a chroma block of the current block andinclude luma reference pixels adjacent to a luma block corresponding tothe chroma block; generating an input block formed as a one-dimensional(1D) vector or a two-dimensional (2D) vector by rearranging thereference pixels; and generating a chroma prediction block of thecurrent block by inputting the input block into an estimating model thatis a deep learning-based model.
 13. A computer-readable recording mediumstoring a bitstream generated by a video encoding method for predictinga chroma component of a current block using a luma component, the videoencoding method comprising: obtaining reference pixels that includechroma reference pixels spatially adjacent to a chroma block of thecurrent block and include luma reference pixels adjacent to a luma blockcorresponding to the chroma block; generating an input block formed as aone-dimensional (1D) vector or a two-dimensional (2D) vector byrearranging the reference pixels; and generating a chroma predictionblock of the current block by inputting the input block into anestimating model that is a deep learning-based model.